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Abstract 

The policy gradient approach is a flexible and powerful reinforcement learning 
method particularly for problems with continuous actions such as robot control. 
A common challenge in this scenario is how to reduce the variance of policy gradi- 
ent estimates for reliable policy updates. In this paper, we combine the following 
three ideas and give a highly effective policy gradient method: (a) the policy gra- 
dients with parameter based exploration, which is a recently proposed policy search 
method with low variance of gradient estimates, (b) an importance sampling tech- 
nique, which allows us to reuse previously gathered data in a consistent way, and (c) 
an optimal baseline, which minimizes the variance of gradient estimates with their 
unbiasedness being maintained. For the proposed method, we give theoretical anal- 
ysis of the variance of gradient estimates and show its usefulness through extensive 
experiments. 



1 Introduction 

The objective of reinforcement learning (RL) is to let an agent optimize its decision- 
making policy through interaction with an unknown environment [25]. Among possible 
approaches, policy search has become a popular method because of its direct nature for 
policy learning [1] . Particularly, in high- dimensional problems with continuous states and 
actions, policy search has been shown to be highly useful in practice [T4l [T6] . 

Among policy search methods [3] , gradient-based methods are popular in physical con- 
trol tasks because policies are changed gradually [261 QUI EE] and thus steady performance 
improvement is ensured until a local optimal policy has been obtained. However, since 
the gradients estimated with these methods tend to have large variance and thus they 
may suffer from slow convergence. 

Recently, a novel approach to using policy gradients called policy gradients with pa- 
rameter based exploration (PGPE) was proposed [20]. PGPE tends to produce gradi- 
ent estimates with low variance by removing unnecessary randomness from policies and 
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introducing useful stochasticity by considering a prior distribution for policy parame- 
ters. PGPE was shown to be more promising than alternative approaches experimentally 
[2"P| E2] • However, PGPE still requires a relatively large number of samples to obtain 
accurate gradient estimates, which can be a critical bottleneck in real-world applications 
that require large costs and time in data collection. 

To overcome this weakness, an importance sampling technique [7] is useful under the 
off-policy RL scenario, where a data-collecting policy and the current target policy are 
different in general [25] . An importance sampling technique allows us to reuse previously 
collected data, which are collected following policies different from the current one in a 
consistent manner [2SJ [22]- However, naively using an importance sampling technique 
significantly increases the variance of gradient estimates, which can cause sudden changes 
in policy updates [2H UHl EJ [28] - To mitigate this problem, variance reduction techniques 
such as decomposition [TS] , truncation [2EJ I2Z] , normalization [2U H5] , and flattening [S] 
of importance weights are often used. However, these methods commonly suffer from the 
bias- variance trade-off, meaning that the variance is reduced at the expense of increasing 
the bias. 

The purpose of this paper is to propose a new approach to systematically address- 
ing the large variance problem in policy search. Basically, this work is an extension of 
our previous research [52] to an off-policy scenario using an importance weighting tech- 
nique. More specifically, we first give an off-policy implementation of PGPE called the 
importance-weighted PGPE (IW-PGPE) method for consistent sample reuse. We then de- 
rive the optimal baseline for IW-PGPE to minimize the variance of importance-weighted 
gradient estimates, following [8j [29] . We show that the proposed method can achieve 
significant performance improvement over alternative approaches in experiments with an 
artificial domain. We also investigate that combining the proposed method with the 
truncation technique can further improve the performance in high- dimensional problems. 

2 Formulations of Policy Gradient 

In this paper, we consider the standard framework of episodic reinforcement learning (RL) 
in which an agent interacts with an environment modeled as a Markov decision process 
(MDP) [21]. In this section, we first review a standard formulation of policy gradient 
methods J2U QUI US]. Then we show an alternative formulation adopted in the PGPE 
(policy gradients with parameter based exploration) method [20J. 




2.1 Standard Formulation 

We assume that the underlying control problem is a discrete-time MDP. At each discrete 
time step t, the agent observes a state s t G S, selects an action a t G A, and then receives 
an immediate reward r t resulting from a state transition in the environment. The state S 
and action A are both defined as continuous spaces in this papei0. The dynamics of the 

■"■Note that continuous formulation is not an essential restriction. 




2 



environment are characterized by p(st+x\st, at), which represents the transition probability 
density from the current state s t to the next state s t+ i when action a t is taken, and p(si) 
is the probability density of initial states. The immediate reward r t is given according to 
the reward function r(s t ,a t , s m ). 

The agent's decision making procedure at each time step t is characterized by a param- 
eterized policy p(at\st, 0) with parameter 0, which represents the conditional probability 
density of taking action a t in state s t . We assume that the policy is continuously differ- 
entiable with respect to its parameter 6. 

A sequence of states and actions forms a trajectory denoted by 

h := [s 1 ,a 1 , . . . ,s T ,a T ], 

where T denotes the number of steps called horizon length. In this paper, we assume 
that T is a fixed deterministic number. Note that the action at is chosen independently 
of the trajectory given St and 0. Then the discounted cumulative reward along h, called 
the return, is given by 

T 
t=l 

where 7 G [0, 1) is the discount factor for future rewards. 

The goal is to optimize the policy parameter so that the expected return is maximized. 
The expected return for policy parameter is defined by 

J(0) : = J p(h\0)R(h)dh, 

where 

T 

p(h\0) = p(si) Y\_p(s t+ i\s t , a t )p(a t \s t , 0). 
t=i 

The most straightforward way to update the policy parameter is to follow the gradient in 
policy parameter space using gradient ascent: 

<— + eV e J(0), 

where e is a small positive constant, called the learning rate. 

This is a standard formulation of policy gradient methods [HH EEl EE]- The central 
problem is to estimate the policy gradient VoJ(0) accurately from trajectory samples. 

2.2 Alternative Formulation 

However, standard policy gradient methods were shown to suffer from high variance in 
the gradient estimation due to randomness introduced by the stochastic policy model 
p(a\s, 0) |32j. To cope with this problem, an alternative method called policy gradients 
with parameter based exploration (PGPE) was proposed recently [20] . The basic idea of 
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PGPE is to use a deterministic policy and introduce stochasticity by drawing parameters 
from a prior distribution. More specifically, parameters are sampled from the prior dis- 
tribution at the start of each trajectory, and thereafter the controller is deterministic]^. 
Thanks to this per-trajectory formulation, the variance of gradient estimates in PGPE 
does not increase with respect to trajectory length T. Below, we review PGPE. 
PGPE uses a deterministic policy with typically a linear architecture: 

p(a\s,0) = 6(a = T <f>(s)), (1) 

where 5(-) is the Dirac delta function, 4>(s) is an ^-dimensional basis function vector, and 
T denotes the transpose. The policy parameter is drawn from a prior distribution p(6\p) 
with hyper-parameter p. 

The expected return in the PGPE formulation is defined in terms of expectations over 
both h and function of hyper-parameter p: 

J(p) := J J p(h\0)p(0\p)R(h)dhd6. 

In PGPE, the hyper-parameter p is optimized so as to maximize J(p), i.e., the optimal 
hyper-parameter p* is given by 

p* := argmax l 7(p). 
p 

In practice, a gradient method is used to find p*: 

p^ p + eV p J(p), 
where V pi 7(p) is the derivative of J with respect to p: 

^pJ(p) = ff p(h\0)p(0\p)V p \ogp(0\p)R(h)dhd0. (2) 
Note that, in the derivation of the gradient, the logarithmic derivative, 

was used. The expectations over h and are approximated by the empirical averages: 

1 - 

V p J(p) = -Y,^P^gp{O n \p)R{h n ) 1 (3) 

n=l 

where each trajectory sample h n is drawn independently from p(h\O n ) and parameter n 
is drawn from p(O n \p). We denote samples collected at the current iteration as 

D = {(O n ,h n )}» =1 . 



2 Note that transitions are stochastic, and thus trajectories are also stochastic even though the policy 
is deterministic. 
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Following [20], in this paper we employ a Gaussian distribution as the distribution of 
the policy parameter 6 with the hyper-parameter p. However, other distributions can also 
be allowed. When assuming a Gaussian distribution, the hyper-parameter p consists of 
a set of means {rji} and standard deviations {t^}, which determine the prior distribution 
for each element 9i in 6 of the form 

pip i \ Pi )=M(O i \T H rf\ 

where J\f(9i\r)i, r?) denotes the normal distribution with mean rji and variance rf. Then 
the derivative of logp(0|p) with respect to rji and Tj are given as 

V Vl logp(0|p) = - ^ , 

V n \og P {0\p) J 9l - V {- T \ 

which can be substituted into Eq.([3]) to approximate the gradients with respect to r] and 
r. These gradients give the PGPE update rules. 

An advantage of PGPE is its low variance of gradient estimates: Compared with a 
standard policy gradient method REINFORCE [21], PGPE was empirically demonstrated 
to be better in some settings [201 [32] . The variance of gradient estimates in PGPE can 
be further reduced by subtracting an optimal baseline (Theorem 4 of [32]). 

Another advantage of PGPE is its high flexibility: In standard policy gradient meth- 
ods, the parameter 6 is used to determine a stochastic policy model p(a\s, 6), and policy 
gradients are calculated by differentiating the policy with respect to the parameter. How- 
ever, because PGPE needs not calculate the derivative of the policy, a non-differentiable 
controller is also allowed. 

3 Off- Policy Extension of PGPE 

In real-world applications such as robot control, gathering roll-out data is often costly. 
Thus, we want to keep the number of samples as small as possible. However, when the 
number of samples is small, policy gradients estimated by the original PGPE are not 
reliable enough. 

The original PGPE is categorized as an on-policy algorithm [25], where data drawn 
from the current target policy is used to estimate policy gradients. On the other hand, 
off-policy algorithms are more flexible in the sense that a data-collecting policy and the 
current target policy can be different. In this section, we extend PGPE to an off-policy 
scenario using importance-weighting, which allows us to reuse previously collected data 
in a consistent manner. We also theoretically analyze properties of the extended method. 

3.1 Importance- Weighted PGPE 

Let us consider an off-policy scenario where a data- collecting policy and the current target 
policy are different in general. In the context of PGPE, we consider two hyper-parameters, 
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p for the target policy to learn and p' for data collection. Let us denote data samples 
collected with hyper-parameter p' by D'\ 

D' = {{0' n X)}ti l ~p(h,0\p>)= P (h\0) P (0\p>). 

If we naively use data D' to estimate policy gradients by Eq.flS]), we have an inconsistency 
problem: 

1 N ' 

F I> r P io gP (o' n \p)R(ti n ) N ^°°v p j(p), 

n=l 

which we refer to as "non-importance- weighted PGPE" (NIW-PGPE). 

Importance sampling [7] is a technique to systematically resolve this distribution mis- 
match problem. The basic idea of importance sampling is to weight samples drawn from a 
sampling distribution to match the target distribution, which gives a consistent gradient 
estimator: 

N> 

VpJm(p) := ^E^)VplogK0>m) N ^°V p J(p), 

n=l 

where 

p(6\p) 



w(0) 



p(0\p') 

is called the importance weight. 

An intuition behind importance sampling is that if we know how "important" a sample 
drawn from the sampling distribution is in the target distribution, we can make adjustment 
by importance weighting. We call this extended method importance- weighted PGPE (IW- 
PGPE). 

Now we analyze the variance of gradient estimates in IW-PGPE. For a multi- 
dimensional space, we consider the trace of the covariance matrix of gradient vectors. 
That is, for a random vector A = {A\, . . . , Ai) T , we define 



Var(A) =tr (E[(A - E[A])(A - EL4]) 1 

e 

J]E[(A m -E[A m ]) 2 ], (4) 



m=l 



where E denotes the expectation. 
Let 



B =J2 T >~ 



i=l 



where I is the dimensionality of the basis function vector (f>(s). For a p = (ry, r), we have 
the following theoremU: 



3 Proofs of all theorems are provided in Appendix, which are basically extensions of the proofs for the 
plain PGPE given in [3 2) to importance-weighting scenarios. 
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Theorem 1. Assume that for all s, a, and s' , there exists (3 > such that r(s,a,s') e 
[—(3,(3], and, for all 0, there exists < w max < oo such that < w{6) < w max - Then we 
have the following upper bounds: 



Var 
Var 



V Tl Jiw(p) 



~ iV'(l- 7 ) 2 Wmi 
^ 2/3 2 (l - j T ) 2 B 
~ iY'(l- 7 ) 2 Wn 



Theorem CD shows that the upper bound of the variance of Vr,Jrw(p) is proportional 
to (3 2 (the upper bound of squared rewards), u> max (the upper bound of the importance 
weight w(0)), B (the trace of the inverse Gaussian covariance), and (1 — 7 T ) 2 /(1 — 7) 2 , 
and is inverse-proportional to sample size N' . It is interesting to see that the upper bound 
of the variance of V Tl 7rw(p) is twice larger than that of V TJl 7iw(p)- 

It is also interesting to see that the upper bounds are the same as the upper bounds 
for the plain PGPE (Theorem 1 of [22]) except for the factor w max ; when w max = 1, 
the bounds are reduced to those of the plain PGPE method. However, if the sampling 
distribution is significantly different from the target distribution, u> max can take a large 
value and thus IW-PGPE tends to produce a gradient estimator with large variance (at 
least in terms of its upper bound). Therefore, IW-PGPE may not be a reliable approach 
as it is. 

Below, we give a variance reduction technique for IW-PGPE, which leads to a highly 
effective policy gradient algorithm. 



3.2 Variance Reduction by Baseline Subtraction for IW-PGPE 

To cope with the large variance of gradient estimates in IW-PGPE, several techniques have 
been developed in the context of sample reuse, for example, by flattening [9], truncating 
[28], and normalizing [21] the importance weight. Indeed, from Theorem (TJ we can see 
that decreasing u> max by flattening or truncating the importance weight reduces the upper 
bounds of the variance of gradient estimates. However, all of those techniques are based 
on the bias-variance trade-off, and thus they lead to biased estimators. 

Another, and possibly more promising variance reduction technique is subtraction of 
a constant baseline [2U HEJJ EJ [29] , which reduces the variance without increasing the bias. 
Here, we derive an optimal baseline for IW-PGPE to minimize the variance, and analyze 
its theoretical properties. 

A policy gradient estimator with a baseline b G M is defined as 

1 N ' 

v p jUp) ■= ^7 EW^ -b)w(e' n )v p iog P (0' n \p). 

n=l 

It is well known that V p j7ivv(p) i s a consistent estimator of the true gradient for 
any constant b [8\. Here, we determine the constant baseline b so that the variance is 
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minimized, following the line of [32]. Let b* be the optimal constant baseline for IW- 
PGPE that minimizes the variance: 

b* : = argmm Var[V p Jrw(p)]. 

b 

Then the following theorem gives the optimal constant baseline for IW-PGPE: 
Theorem 2. The optimal constant baseline for IW-PGPE is given by 

_ E p{W) [R(h)w 2 (Q)\\V p \ogp(e\p)\\"} 

E p(W) [w^0)\\v p io gP (e\pW] ' 

and the excess variance for a constant baseline b is given by 

Var[V p ^ 6 w (p)] - Var[V p J^(p)] = ^^E KW) K(0)||V p logp(0|p)|| 2 ], 

where ^ P (h,e\p')[ m ] denotes the expectation of the function of random variables h and with 
respect to (h, 6) ~ p(h, 0\p'). 

The above theorem gives an analytic expression of the optimal constant baseline for 
IW-PGPE. It also shows that the excess variance is proportional to the squared difference 
of baselines (b — b*) 2 and the expectation of the product of squared importance weight 
w(0) and the squared norm of characteristic eligibility || V p \ogp(d\p) || 2 , and is inverse- 
proportional to sample size N' . 

Next, we analyze contributions of the optimal baseline to variance reduction in IW- 
PGPE: 

Theorem 3. Assume that for all s, a, and s' , there exists a > such that r(s, a, s') > a, 
and, for all 0, there exists w m i n > such that w(0) > w min . Then we have the following 
lower bounds: 



Var 
Var 



V T Jrw(p) 



Var 
Var 



a 2 (l- 7 T ) 2 £? 
" iV'(l- 7 ) 2 Wmi 

2a\\ - 7 T ) 2 g 
" iV'(l-7) 2 Wn 



Assume that for all s, a, and s' , there exists (3 > such that r(s, a, s') G [— f3, f3], and, for 
all 6, there exists < u> max < oo such that < w{6) < w max . Then we have the following 
upper bounds: 



Var 
Var 



V TJl ^w(p) 

v T jiw(p) 



Var 
Var 



V Tl 7 : b w (p 



~ iV'(l- 7 ) 2 Wmi 

2/3 2 (l- 7 T ) 2 ^ 
S iV'(l- 7 )2 Wn 
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This theorem shows that the bounds of the variance reduction in IW-PGPE brought 
by the optimal constant baseline depend on the bounds of the importance weight. If 
importance weights are larger, using the optimal baseline can reduce the variance more. 

Based on Theorems [T] and [31 we get the following corollary: 

Corollary 4. Assume that for all s, a, and s' , there exists < a < (3 such that 
r(s,a,s') G [a, and, for all 6, there exists < u> mm < w max < oo such that 
w m i n < w{0) < w mm . Then we have the following upper bounds: 



Var 
Var 



V„Jrw(p) 



< 



< 



(l-l T ?B 
N'(l -7) 2 
2(1 - 7 T ) 2 ff 
N'(l -7) 2 



(/3 2 Wmax - a 2 W m i Q ), 

((3 2 w m , 



a 2 Wr, 



Comparing Theorem [T] and this corollary, we can see that the upper bounds for IW- 
PGPE with the optimal constant baseline are smaller than those for IW-PGPE with no 
baseline because a 2 w m i n > 0. Although they are just upper bounds, they can still intu- 
itively show that subtraction of the optimal constant baseline contributes to mitigating 
the large variance caused by importance weighting. If u> mm is larger, then the upper 
bounds for IW-PGPE with the optimal constant baseline can be much smaller than those 
for IW-PGPE with no baseline. 



4 Experimental Results 

In this section, we experimentally investigate the usefulness of the proposed method, 
importance-weighted PGPE with the optimal constant baseline (which we denote by 
IW-PGPEqb hereafter). In the experiments, we estimate the optimal constant base- 
line using all collected data, as suggested in jSJ [16j [29]. This approach introduces bias 
into the method because the same sample-set is used both for estimating the gradient 
and the baseline. Another possibility is to split the data into two parts: One is used 
for estimating the optimal constant baseline and the other is used for estimating the 
gradient. However, we found that this splitting approach does not work well in our pre- 
liminary experiments. The MATLAB implementation of IW-PGPEob is available from: 
http : //sugiyama-www. cs . titech. ac . jp/~tingting/ software .html 

4.1 Illustrative Example 

First, we illustrate the behavior of PGPE methods using a toy dataset. 
4.1.1 Setup 

The dynamics of the environment is defined as 

st+i = s t + a t + e, 

9 



where St G R, Ot 6 R, and e ~ A/"(0, 0.5 2 ) is stochastic noise. The initial state s\ is ran- 
domly chosen from the standard normal distribution. The linear deterministic controller 
is represented by a t = 9s t for 9 G R. The immediate reward function is given by 

r(s t , a t ) = exp (-s 2 /2 - a 2 t /2) + 1, 

which is bounded in (1,2]. In the toy dataset experiments, we always set the discount 
factor at 7 = 0.9, and we always use the adaptive learning rate e = 0.1/ 1| V pv 7(p)|| [TTj . 
Here, we compare the following PGPE methods: 

• PGPE: Plain PGPE without data reuse [20]. 

• PGPEob : Plain PGPE with the optimal constant baseline without data reuse [32] . 

• NIW-PGPE: Data-reuse PGPE without importance weights. 

• NIW-PGPEob: Data- reuse PGPEob without importance weights. 

• IW-PGPE: Importance-weighted PGPE. 

• IW-PGPEob : Importance-weighted PGPE with the optimal baseline. 

Suppose that a small amount of samples consisting of N trajectories with length T is 
available at each iteration. More specifically, given the hyper-parameter p L = (t)l, t l ) at 
the Lth iteration, we first choose the policy parameter 9^ from p(9\pi), and then run the 
agent to generate trajectory h% according to p(h\9%). Initially, the agent starts from a ran- 
domly selected state sj following the initial state probability density p(si) and chooses an 
action based on the policy p(at\st, 9^). Then the agent makes a transition following the dy- 
namics of the environment p(st+i\st, a t ) and receives a reward r t = r(st, a t , St+i)- The tran- 
sition is repeated T times to get a trajectory, which is denoted as = {st, a t , r t , s t +i}f = x- 
We repeat the procedure N times, and, the samples gathered at the Lth iteration is 
obtained, which is expressed as D L = {(9^,h^)}^ =1 . 

In the data-reuse methods, we estimate gradients at each iteration based on the current 
data and all previously collected data D VL = {D l }f =1 , by the estimated gradients to 
update the policy hyper-parameters (i.e., mean 77 and standard deviation r). In the plain 
PGPE method and the plain PGPEob method, we only use the on-policy data D L to 
estimate the gradients at each iteration, by the estimated gradients to update the policy 
hyper-parameters. If the deviation parameter r takes a value smaller than 0.05 during 
the parameter-update process, we set it at 0.05. 

Below, we experimentally evaluate the variance, bias, and mean squared error of the 
estimated gradients, trajectories of learned hyper-parameters, and obtained returns. 

4.1.2 Estimated Gradients 

We investigate how data reuse influences estimated gradients over iterations. Below, we 
focus on gradients with respect to the mean parameter 77. 
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We randomly choose initial mean parameter r] from the standard normal distribution, 
and fix the initial deviation parameter at r = 1. We collect N = 10 trajectories with 
the trajectory length T = 10 at each iteration, and update hyper-parameters over 20 
iterations. Here, the variance and squared bias of estimated gradients at each iteration 
(e.g., at the Lth iteration, L — 1, . . . , 20) are investigated for M = 10000 trials: 

1 M 1 M 



Bias 2 :-- 



m=l 
M 



m'=l 



jjJ2 v vJ m (PL)-v riL j( Pl 



m=l 



where V VL J m {pi) is an estimated gradient in the m-th trial. More specifically, we es- 
timate the gradients M times with different random seeds at the Lth iteration as fol- 
lows: We generate samples D^ L = {D l m }f =l following the corresponding distributions 



{D l m 



i.i.d 



p(h,Q\pi)}f =l in each trial 



[m 



V ' VL J m (pL) with the generated samples D 



1:L 
m " 



..,M), and we estimate the gradient 
The variance and squared bias at the 



Lth iteration are calculated based on the estimated gradients from M trials. In this ex- 
periment, the true gradient V 7?l J7'(pl) at the Lth iteration is approximated by the plain 
PGPE method using Eq.([3]) with N = 10000 on-policy samples. Note that the sum of the 
variance and squared bias agrees with the mean squared error: 



1 M 

Var + Bias 2 = — £ \\V VL J m (p L ) - V, L J{p L ) 

m=l 



(5) 



We update the hyper-parameters pi based on the estimated true gradient V Vl <J(Pl), 
and obtain pL+i- Then, we investigate the variance and bias at the next iteration, i.e., 
the (L + l)th iteration, following the above procedures. Figure [1] shows the variance and 
squared bias over 20 iterations. 



From Figure 1(a), we can see that IW-PGPE b provides gradient estimates with the 
lowest variance among the compared methods. IW-PGPE has a larger variance than 
NIW-PGPE, which well agrees with our theoretical analysis: According to Theorem [TJ 
upper bounds of the variance are proportional to the importance weight, which is always 
1 in NIW-PGPE, but is very large in IW-PGPE if the target distribution is significantly 
different from the sampling distribution. In order to see whether the upper bound of 
importance weights is really large, we measure the maximum value of importance weights 



over iterations, which is shown in Figure [2j Figure 2(a) shows that the maximum value 
of importance weights tends to be larger over iterations, which further illustrates how 
importance weights influence the variance of gradient estimates in IW-PGPE. 

We can also see that the gap in the variance between IW-PGPE and IW-PGPE b 
tends to be larger over iterations, which is also consistent with our theoretical analysis: 
According to Theorem [31 the larger the importance weight is, the more the optimal 
constant baseline contributes to reducing the variance. The importance weight may get 
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2 4 6 8 10 12 14 16 18 20 

Iteration 

(a) Variance 




2 4 6 8 10 12 14 16 18 

Iteration 



(b) Bias 2 

Figure 1: Variance and Bias 2 of gradient estimates with respect to the mean parameter 
r) through parameters update iterations. 

larger at later iterations, because distributions in the first and the last iterations may 
be significantly different (Figure [2] exactly illustrates this phenomenon). Thus, variance 
reduction from IW-PGPE to IW-PGPEqb by the optimal constant baseline tends to be 
more significant in later iterations. Gradient estimates in both NIW-PGPEqb and IW- 
PGPEqb are with smaller variance than the plain PGPEqb method, because the more 
data we use, the smaller variance of gradient estimates we can obtain as expected from 
the theory. IW-PGPEob provides smaller variance than NIW-PGPEob, which is our 
expected result: According to Theorem El if the importance weights are larger, using the 
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10 12 14 

Iteration 



fa) IW-PGPE 




(b) IW-PGPEqb 



Figure 2: Average maximum values of importance weights over 20 runs through parameter 
update iterations. 



optimal constant baseline can reduce variance more, while the importance weights are 



always 1 in NIW-PGPEqb (see Figure 2(b)). The plain PGPEqb has smaller variance 



than the plain PGPE, which well agrees with the results reported in [32] . 



Figure 1(b) shows that introduction of the optimal baseline does not increase the bias. 
NIW-PGPE and NIW-PGPEob have very large bias, because naively reusing previous 
data leads to an inconsistent and biased gradient estimator. The bias of gradient estimates 
in IW-PGPE is fairly small, because IW-PGPE is not only consistent, but also unbiased. 
The plain PGPE and plain PGPEqb axe also with small bias, as expected. 

Because our proposed IW-PGPEqb has small bias and the smallest variance among 
the compared methods, it also gives the smallest mean squared error (see Eq.(jS])). 



4.1.3 Hyper-Parameter Trajectories 

Next, we illustrate how learned hyper-parameters change over iterations. Here we compare 
the behavior of the following three methods: NIW-PGPE, IW-PGPE and our proposed 
method IW-PGPEqb- We fix the initial deviation parameter at r = 1, and test the three 
different initial mean parameters: rj = —1.6, —0.8, and —0.1. Figure |3] depicts the contour 
of the expected return, where the maximum of the return surface is located at the middle 
bottom. 

First, let us investigate how the hyper-parameters change over 20 iterations in a large- 
sample case with N = 10. From Figure 3(a), we can see that NIW-PGPE can not 
properly update the solutions, which means that the inconsistency can not be overcome 



by increasing the number of samples. On the other hand, Figure 3(c) shows that IW- 



PGPE can lead the solutions to an area with large returns sometimes, but can not always 
reach an area with large returns after 20 iterations. This indicates that the consistency of 
importance weighting tends to be helpful when the number of samples is large, but it can 



not converge rapidly because of the large variance. Figure 3(e) shows that IW-PGPEqb 



gives the reliable update directions and the three paths converge rapidly to the vicinity 
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of the maximum point without detours. This shows that the optimal constant baseline 
highly contributes to improving the convergence property of IW-PGPE. 



Next, we investigate the performance over 200 iterations with only N = 1. Figure 3(b) 



shows that NIW-PGPE can not properly update the solutions to the maximum point 



because of the inconsistency, and Figure 3(d) shows that the IW-PGPE solutions can 



not always reach an area with large returns (middle bottom) after 200 iterations, which is 



because the variance in IW-PGPE is crucial in this extreme scenario. However, Figure 3(e) 
shows that the proposed IW-PGPEqb can still find fairly reliable update directions with 
only N = 1. 

Next, we investigate the directions of estimated gradients more systematically. We fix 
the starting point at rj = —0.8 and r = 0.5. The true gradient direction is calculated 
by the plain PGPE method with 10000 on-policy samples. In this experiment, we first 
collect N' = 10 off-policy samples, which are drawn from J\f(— 1.6, 1). We then reuse these 
off-policy samples to estimate the gradients in the data-reuse methods. We calculate the 
gradients 20 times with different random seeds, and investigate the angle between the 
true gradient and the estimated gradients. The results are summarized in Figure |H 



In Figure 4(a) , the red line denotes the true gradient and blue lines are the estimated 



gradients by the NIW-PGPE method. The histograms of angles between the true gradient 



and the estimated gradients are plotted in Figure 4(b) The graph shows that the angles 
are concentrated in [—150, —90], which further explains the inconsistent property of the 

we 



NIW-PGPE method. Observing the angle distribution for IW-PGPE in Figure 4(d) 
can see that the angles are widely distributed in [—180, 180], which clearly illustrates the 
large variance problem of IW-PGPE. On the other hand, the angles for the IW-PGPEqb 
method are concentrated in [—60,60], which highlights the small variance and consistent 
properties of IW-PGPEqb- 

4.1.4 Performance of Learned Policies 

Finally, we evaluate average expected returns obtained by each method over 20 runs. The 
expected return at each trial is approximated using 100 newly-drawn test episodic data 
(which are not used for policy learning). The initial mean parameter rj is chosen randomly 
from the standard normal distribution, and the deviation parameter is fixed at r = 1. 

Figure shows that IW-PGPEqb improves the performance over iterations and con- 
verges very fast. The performance of NIW-PGPE is not largely improved over iterations, 



which is caused by biased gradient estimates (see Figure 3(a) again). IW-PGPE works 
better than NIW-PGPE, but the performance is saturated after 9 iterations. IW-PGPEqb 
does not outperform NIW-PGPEqb that much at the first several iterations, because the 
difference between the target distribution and a sampling distribution is not that large 
at the beginning. However, the upper bound of importance weights tends to become 



larger over iterations (see Figure 2(b) again), which makes IW-PGPEob more reliable 
than NIW-PGPEqb in the latter iterations. The plain PGPEqb method works fairly well 
with N = 10 on-policy samples, but it is still not as good as IW-PGPEqb . 
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Figure 3: Trajectories of policy hyper-parameters over iterations. 
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estimated gradients. 



4.2 Mountain Car 

Next, we evaluate our proposed method in the mountain car task, which is illustrated in 
Figure El The task consists of a car and two hills whose landscape is described as sin(3:r). 
The top of the right hill is the goal to which we want to guide the car. 
We compare the following 7 methods: 

• TIW-eNAC: Truncated importance- weight episodic natural actor-critic, which is 
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Figure 5: Average expected returns through policy update iterations over 20 runs for toy 
data. Error bars denote standard errors. 



an episodic version of the sample- reuse NAC method [2Sl[T7]. Following the same 
line as [2S], we truncate the importance weight as w = min{u>, 2}. 

• IW-REINFORCEob: Importance- weighted REINFORCE with the optimal base- 
line, which is basically a combination of the off-policy implementation of the episodic 
REINFORCE method [12] and the optimal baseline [16] . although we could not ex- 
actly find this method in literature. 

• R 3 : Reward- weighted regression with sample reuse [9]. 

• PGPEob : Plain PGPEqb without data reuse. 

• NIW-PGPEob : Data-reuse PGPEqb without importance weighting. 

• IW-PGPE: Importance-weighted PGPE. 

• IW-PGPEob : Importance-weighted PGPE with the optimal baseline. 

The state space S is two-dimensional and continuous, which consists of the horizontal 
position x[m] G [—1.2,0.5] and the velocity x[m/s] G [—1.5, 1.5], i.e., s = (x,x) T . This is 
non-linearly transformed to a feature space via a basis function vector 4>(s). We use 12 
Gaussian kernels with mean c and standard deviation k = 1 as the basis functions, 



4>(s) = exp 



ell 2 



2k 2 

where the kernel centers c are distributed over the following grid points: 

{-1.2,-0.35,0.5} x {-1.5,-0.5,0.5,1.5}. 
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The action space A is one- dimensional and continuous, which corresponds to the force 
applied to the car (note that the force of the car is not strong enough to climb up the slope 
to directly reach the goal). We use the Gaussian policy model for IW-REINFORCEqB) 
TIW-eNAC, and R 3 : 

/ i „x 1 ( (a- v T (f>(s)) 2 \ 

where [i is the mean policy parameter and a is the deviation policy parameter. We 
employ a linear deterministic policy model (TTJ for the PGPE methods, which corresponds 
to Eq.© with a 0. 

The dynamics of the car (i.e., the update rules of the position and the velocity) are 
given by 

x t+ i = x t + x t+1 At, 

Xt+i — x t + (— 9.8w;cos(3x i ) + — — kx t )At, 

w 

where a t is the action taken at time t. We set the problem parameters as follows: The 
mass of the car w = 0.2 [kg], the friction coefficient k = 0.3, and the simulation time step 
At = 0.1[s]. The reward function is defined as 



r( s t , a t , St+i 



1 if x t+ i > 0.45, 
1 otherwise. 



The initial mean parameter rj is chosen randomly from the standard normal distribu- 
tion, and the initial deviation parameter is set at r = 1. The initial state of the car is 
set at the bottom of the mountain with the velocity x = 0. The agent collects iV = 10 
episodic samples with trajectory length T = 40 at each iteration. In the data reuse meth- 
ods, we reuses all previous data at later iterations. In the plain PGPEqb method, we 
just use N = 10 on-policy samples at each iteration to estimate policy gradients. The 
discount factor is set at 7 = 0.95. The learning rate is e = 1/|| V p J(p)\\. 

We investigate average expected returns over 10 trials as functions of policy-update 
iterations. The expected return at each trial is computed over 100 newly-drawn test 
episodic samples (which are not used for policy learning). The experimental results are 
plotted in Figure [71 This shows that IW-PGPEob improves the performance very fast 
over policy-update iterations, and it achieves superior performance improvement than 
all other methods. IW-PGPE can also improve the performance over iterations well, 
implying that the consistency of the IW estimator is useful in this task. However, it is 
outperformed by the proposed IW-PGPEqb, perhaps because the estimation variance in 
IW-PGPE is large. NIW-PGPEqb performs fairly well, which maybe because the bias 
of policy gradient estimators is not that crucial in this experiment. The plain PGPEqb 
can improve the performance throughout the iterations, which indicates that iV = 10 
on-policy samples is enough for this mountain-car task. Other data-reuse methods can 
improve the performance over iterations, but slowly, and they are outperformed by the 
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Figure 6: Mountain car. 
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Figure 7: Average expected returns over 10 runs as functions of the number of iterations 
for the mountain-car task. Error bars are standard errors. 

compared PGPE methods. IW-REINF0RCE O b outperforms TIW-eNAC, which maybe 
because the optimal constant baseline contributes significantly in IW-REINFORCEob and 
truncating the importance weights can lead to a larger bias over iterations in TIW-eNAC. 
R 3 can not improve the performance over iterations. Overall, thanks to the low variance, 
IW-PGPEqb achieves smooth and fast policy improvement throughout iterations, and its 
performance is the best among the compared methods. 

4.3 Upper-body Humanoid Control 

Finally, we evaluate the performance of our proposed method on a highly nonlinear dy- 
namic control problem of the simulated upper-body model of the humanoid robot CB-i 
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(a) CB-i (b) Simulated upper-body model 

Figure 8: Humanoid robot CB-i and its upper-body model. 



[I] (see Figure 8(a) ). We use its simulator in our experiments (see Figure 8(b) ). The goal 
is to lead the end-effector of the right arm (right hand) to a target object. 

4.3.1 Setup 

We compare the performance of the following 4 methods: 

• IW-REINFORCE OB : Importance- weighted REINFORCE with the optimal base- 
line. 

• NIW-PGPEob : Data-reuse PGPEqb without importance weighting. 

• PGPEob : Plain PGPEqb without data reuse. 

• IW-PGPEob: Importance- weighted PGPE with the optimal baseline. 

The simulation is based on the upper body of the CB-i humanoid robot illustrated in 



Figure 8(b), which has 9 degrees of freedom corresponding to main joints of the upper 
body: The shoulder pitch, shoulder roll, elbow pitch of the right arm, shoulder pitch, 
shoulder roll, elbow pitch of the left arm, waist yaw, torso roll, and torso pitch. 

At each time step, the controller receives states from the system and sends out actions. 
The state space is 18-dimensional, which corresponds to the current angle and the current 
angular velocity of each joint. The action space is 9-dimensional, which corresponds to 
the target angle of each joint. Both states and actions are continuous. 

The initial positions of the robot and an object are fixed, where the initial position of 
the robot is set at the state of standing up straight with the arms down, and the position 
of the target object depends on the task. Note that the position of the target object is 
only used in the designing of the reward function. The reward function is given by 

r t = ki exp(— I0d t ) — fc 2 min{c t , 10000}, 
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where k± = 1, k 2 — 0.0005, dt is the distance between the robot's right hand and the target 
object at the time step t, and q is the sum of control costs for each joint. Note that the 
results may change with different ki and k 2 for the reward function. In order to keep 
the value of exp(— 10cQ and c t in the reward function to the same order of magnitude, 
we need to choose k\ and k 2 reasonably. We use the same policy model as the mountain 
car experiment, i.e., the linear deterministic policy for PGPE and the Gaussian policy for 
IW-REINFORCEob with the basis function <f>{a) = s. 

The initial mean parameter rj is randomly chosen from the standard normal distribu- 
tion, and the initial standard deviation parameter r is set to 1. To evaluate the usefulness 
of the data reuse methods with a small number of samples, the agent collects only N = 3 
on-policy samples with trajectory length T = 100 at each iteration. In the data reuse 
methods, we reuse all previous data at later iterations. In the plain PGPEob, we just use 
the on-policy samples to estimate the gradients. The discount factor is set at 7 = 0.9, 
and the learning rate is set at e — 0.1/|| Vpj'(p) ||. 



4.3.2 Reaching Task with 2 Degrees of Freedom 

First, we investigate the performance on the reaching task with only 2 degrees of freedom. 
We fix the body of the robot and use only the right shoulder pitch and right elbow pitch. 
Figure [9] depicts the averaged expected return over 10 trials as a function of the number 
of iterations. The expected return at each trial is computed from 50 newly-drawn test 
episodic data (which are not used for policy learning). The graph shows that IW-PGPEqb 
nicely improves the performance over iterations only with a small number of on-policy 
samples. The plain PGPEob can also improve the performance over iterations, but slowly. 
NIW-PGPEob is not as good as IW-PGPEob especially at the later iterations, which is 
because of the inconsistent property of the NIW estimator. The initial mean parameter 
is randomly chosen in this experiment, which makes IW-REINFORCEob n ot able to 
improve the performance significantly over iterations. This result is consistent with the 
observation that the REINFORCE method is sensitive to the initial parameter values [32] . 

The distance from the right hand to the object and the control costs along the tra- 
jectory are also investigated. We test the initial policy, the policy obtained at the 20th 
iteration by IW-PGPEqb; and the policy obtained at the 50th iteration by IW-PGPEqb- 



The results are shown in Figure [TDJ From Figure 10(a) , it is clear to see that the policy 
obtained at the 50th iteration decreases the distance fastest compared with the initial 
policy and the policy obtained at the 20th iteration. This means the robot can reach 



the object fast by using the learned policy. On the other hand, Figure 10(b) shows that 
the control cost required for executing the policy obtained at the 50th iteration decreases 
steadily until the reaching task is completed. This is because the robot mainly adjusts 
the shoulder pitch in the beginning, which consumes a larger amount of energy than the 
energy required for controlling the elbow pitch. Then, once the right hand gets closer to 
the target object, the robot starts to adjust the elbow pitch reach the target object. The 
policy obtained at the 20th iteration actually consumes less control costs, but it cannot 
move the arm to the target object. 
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Figure 9: Average expected returns over 10 runs as functions of the number of iterations 
for the reaching task with 2 degrees of freedom (right shoulder pitch and right elbow 
pitch). 
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Figure 10: Distance and control costs of arm reaching with 2 degrees of freedom usin^ 
the policy learned by IW-PGPEqb- 



Figure [IT] shows a typical solution of the reaching task with 2 degrees of freedom by 
IW-PGPEqb (with the policy obtained at the 50th iteration). The images show that the 
policy learned by our proposed method successfully leads the right hand to the target 
object within only 10 time steps. 



4.3.3 Reaching Task with 4 Degrees of Freedom 

Next, we evaluate the performance on the reaching task with 4 degrees of freedom. We 
use the right shoulder pitch, right elbow pitch, right shoulder roll, and torso yaw joint. 
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(f) (g) GO (i) 0) 

Figure 11: Typical example of arm reaching with 2 degrees of freedom using the policy 
obtained by IW-PGPEqb at the 50th iteration. 



By using the torso yaw joint, the robot can reach a distant object which can not be 
achieved by only using the right arm. The results are shown in Figure [12] The graph 
shows that IW-PGPEob achieves fast policy improvement throughout iterations, and the 
performance is the best among the compared methods. 

Figure [TBI depicts a representative example of object reaching with 4 degrees of freedom 
by IW-PGPEqb- Note that the object is distant from the robot and it can not be reached 
by only using the right arm. The robot first adjusts the torso yaw joint, and then uses the 
right arm to reach the object. The images show that the policy learned by our proposed 
method successfully leads the right hand to the distant object. 

4.3.4 Reaching Task with All Degrees of Freedom 

At last, we evaluate the performance on the reaching task with all degrees of freedom. The 
position of the target object is the same as the task in the 4-degrees-of-freedom setting. 

In this experiment, we use all degrees of freedom to reach the object. This increases 
the dimensionality of the state space, which actually may grow the values of importance 
weights exponentially [2"2"| [3] . In order to mitigate the large values of importance weights, 
we decided not to reuse all previously collected samples, but only samples collected in the 
last 5 iterations. This allows us to keep the difference between the sampling distribution 
and the target distribution reasonably small, and thus the values of importance weights 
can be suppressed to some extent. Furthermore, following [28J, we truncate the importance 
weights as w = min{u>,2}. This version of IW-PGPEob is denoted as Truncated IW- 
PGPEqb below. 

The results are shown in Figure [HJ The graph shows that the performance of Trun- 
cated IW-PGPEqb is the best, which implies that the truncation of importance weights 
is helpful when applying our proposed method to high- dimensional problems. 



23 




-2.5„ L 



10 



15 20 25 30 35 40 45 

Iteration 



50 



Figure 12: Average expected returns over 10 runs as functions of the number of iterations 
for the reaching task with 4 degrees of freedom (right shoulder pitch, right elbow pitch, 
right shoulder roll, and torso yaw joint). 




Figure 13: Typical example of arm reaching with 4 degrees of freedom using the policy 
obtained by IW-PGPEqb at the 50th iteration. 
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Figure 14: Average expected returns over 10 runs as functions of the number of iterations 
for the reaching task with all degrees of freedom. 



Through all the arm-reaching experiments, we can see that the returns tend to be lower 
as the dimension is increased, even though we run the higher-dimensional experiment for 
a larger number of iterations. In the task with all degrees of freedom (Figure [T4"|) . the 
largest number of iteration is 400. If we continue the experiment for more iterations, the 
returns may sligtly increase, but are still less than the returns in the low-dimensional 
experiments. This is because the more joints the robot uses, the larger energy will be 
consumed, and thus the returns tend to be lower in high-dimensional cases. 

Overall, the proposed IW-PGPEqb is shown to be a promising method, although in 
the last experiment it is obvious that just like other importance weight-based methods, 
the performance degrades in high-dimensional problems without the use of additional 
correction techniques such as weight truncation. 

5 Discussions and Conclusions 

In many real-world reinforcement learning problems, reducing the number of training 
samples is desirable because the sampling cost is often much higher than the computa- 
tional cost. In this paper, we proposed a new policy gradient method equipped with 
efficient sample reuse, which systematically combines a reliable policy gradient method, 
PGPE, with importance sampling and the optimal constant baseline. We showed that 
the introduction of the optimal constant baseline can mitigate the large- variance problem 
of importance weighting under some conditions. Through experiments with an artificial 
domain, the usefulness of the proposed method was demonstrated. More over, through 
robotic experiments, we found that the truncation technique was helpful when applying 
the proposed method to high-dimensional problems. 
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The low variance of PGPE was brought by considering a deterministic policy and 
introducing the stochasticity by drawing a policy parameter from a prior distribution. 
This per-trajectory formulation was indeed shown to be useful in reducing the variance 
of policy gradient estimates. However, PGPE has limitations, too. For example, the 
use of a finite horizon is essential in PGPE, because the gradient estimates need full 
trajectories. In particular, it is not straightforward to handle the infinite-horizon case. 
Another issue is an extension to a partially-observable case. It is known that for every 
finite Markov decision problem (MDP) there exists a deterministic policy that is optimal 
|19j . However, in a partially-observable MDP (POMDP), the best stationary stochastic 
policy can be arbitrarily better than the best stationary deterministic policy [23]. Thus, 
the deterministic policy in PGPE can be a limitation when extending it to the POMDP 
framework. It is trivial to extend the current formulation to consider stochastic policies. 
However, this may lead to an increase of variance and thus slow down convergence. These 
issues need to be further investigated in the future work. 

The baseline and importance weighting techniques are two independent techniques. 
More specifically, importance weighting is used in the off-policy scenario to efficiently reuse 
previously collected samples, by using importance weighting the consistency between the 
data sampling distribution and the target distribution is kept. On the other hand, the 
optimal constant baseline is used to reduce the variance of gradient estimates. 

The use of a baseline technique has been first proposed in terms of reinforcement 
comparison in [23] , which intuitively means the comparison between the expected return R 
and the baseline b: If R > b we adjust learned parameters p so as to increase the probability 
of 9, and, if R < b, we do the opposite. Based on this idea, Williams [30] demonstrated 
that a baseline technique did not introduce bias, which is because the expectation of 



the coefficient of b is zero, i.e., E 



v P p(e|p) 



0. The effect of the baseline on variance is 



. P(0\ P ) 

considered in [6j. The intuition behind the baseline is that subtracting a baseline from the 
return reduces the magnitude, and thus reduces the variance. Technically, subtracting a 
baseline can be viewed as a control variate technique [7], which is an effective approach to 
reducing variance of Monte Carlo estimates of integrals. The experimental results in the 
paper suggest that the removal of the baseline is possibly the primary factor in improving 
performance compared with the importance weighting techniques. 

In episodic policy gradient methods, the optimal baseline which does not bias policy 
gradient estimates is given by a single scalar for all trajectories [16]. However, in the 
non-episodic policy gradient methods, the optimal baseline can depend on the current 
state [U [T7j. Thus, if a good parameterization for the baseline is known, e.g., in 
a generalized linear form b(s t ) = w T <fi(s t ), this can significantly improve the gradient 
estimation process. However, the selection of the basis function can be difficult and often 
impractical in robotics [16J. On the other hand, it is interesting to see that if the value 
function is used as the baseline function in non-episodic policy gradient methods, such as 
in [IH [26], the term Q(s, a) — V(s) will lead to the advantage function [2], where Q(s, a) 
is action value function and V(s) is the value function. 
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Appendix 

In the appendix, we give proofs of the theorems. 

A Proof of Theorem H 

Proof. Due to the fact that the sampled data {(B' n , h' n )}^ =1 are independent and identi- 
cally distributed, we have 



where h and 6 are random variables and follow the distributions p(h, 0\p'). 

Note that we consider the trace of the covariance matrix of gradient vectors, that is, 
the sum of the variance of the components of the vector. Then by upper-bounding the 
variance with the second moment, we have the following upper bound: 



Var Vr,Ji w (p) 



1 



Var [w(0)V v logp(0\p)R(h)] 



(7) 



Var [w(0)R(h)V v log p(0\p)] 



< J> P(M | P ,) [(w(d)R(h)V Vi log P (0\p)) 2 ] 



i=l 




2 dhdO 



= p{h\e) P {e\p)w{e){R{h)f{v m \ogp{e\p)fdhde 



i=i 




^ E i _ w ™ / / pW0M0|p)(v„ iogp(0|p)) 2 



i=l 




dhdd 




»=i 
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where Ep(eip) [•] denotes the expectation of the function of random variable with respect 
to ~ p(0\p). Subsequently, given the proof of the first part of Theorem 1 in [32], we 

get the upper bound of Var V v Jiw(p) ■ 

Similarly, given the same technique and the proof of the later part of Theorem 1 in 



, we could get the conclusion of the upper bound of Var 



V T Jiw(p) 



□ 



B Proof of Theorem 2 



Proof. First, let us derive some elementary expressions. Let A, C be random variables 
taking values in the ^-dimensional space and let b be a scalar. Then, 

Var [A - bC] = Var[A] + b 2 Var[C] - b Cov[A, C] - b Cov[C, A]. 

We still consider the trace of the covariance matrix of gradient vectors for multi- 
dimensional space. Assume that E[C] = 0. Then, we could have 



Var [A - bC] = Var [A] + b 2 Var[C] - 2b Cov[A, C 
= Var [A] + E[C T C] lb 2 -2b 



2 ^[A T C] 



:Var[A] +E[C T C] 



E[C T C] 
E[A T C}\ 2 (E\A T C\ 



E[C T C]J \E\C T C] 

Simple calculus shows that the foregoing is minimized when 

_ E[A T C] 
~ E[C T C] ' 

The optimal baseline for IW-PGPE follows immediately by plugging in 

A = RwV p \ogp{0\p) 

and 

C = wV p \ogp(0\p) 

for A and C. Note that Eq.flS]) uses the conclusion of E[wV p \ogp(0\p)} = 0, which can 
be found in the proof of Theorem 4 in [32] . 

As the sampled data are independent and identically distributed, we have 

Vax[V p jUp)] = ^Var[A-bC}. 
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Then, according to Eq.flH]) and the definition of b*, we could have 

Var[V p J^(p)] - Var[V^ w (p)] 

{E[A T C}) 2 



^ (b 2 E[C T C] - 2bE[A T C] 



1 



= W (b-b*) 2 E[C i C], 

where the expectation is over random variables h and such that (h,0) ~ p(h,6\p'). 
This completes the proof of Theorem [2j □ 



C Proof of Theorem [3 

Proof. We define V v and V Vi as 

V„ =V„logp(0|p), 
V* =V tK logp(0|p). 

We still denote the subscripts p' as 0|p'). According to Theorem [2} by setting 6 = 0, 
it is easy to know that 



Var 



- Var 



iV'E P 4^ 2 (0)VTV r? ] 



We already know that 



Hence, 



Var 



V r , l 7i W (p) 



Var 



V„Jrw(P 



jV'(l -7) 2 

2 "in; 



^ /3 2 (1~7 
" AT'(l- 7 ) ? 

_ /? 2 (1 - j T ) 2 B 
N'(l - 7) 2 



i=i 



(9) 
(10) 



where Eq.Q is based on the same technique used in Section [S] and Eq. fjTO]) is given by 
results of the proof of Theorem 1 in 
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Similarly, we can have the lower bound as 
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By using the same techniques, we get the bounds of the variance reduction of gradient 
estimation with respect to the deviation parameter r, 
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which completes the proof. 
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